Acta Psychiatrica Scandinavica
○ Wiley
Preprints posted in the last 30 days, ranked by how well they match Acta Psychiatrica Scandinavica's content profile, based on 10 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit.
Trivedi, S.; Simons, N. W.; Tyagi, A.; Ramaswamy, A.; Nadkarni, G. N.; Charney, A. W.
Show abstract
Background: Large language models (LLMs) are increasingly used in mental health contexts, yet their detection of suicidal ideation is inconsistent, raising patient safety concerns. Objective: To evaluate whether an independent safety monitoring system improves detection of suicide risk compared with native LLM safeguards. Methods: We conducted a cross-sectional evaluation using 224 paired suicide-related clinical vignettes presented in a single-turn format under two conditions (with and without structured clinical information). Native LLM safeguard responses were compared with an independent supervisory safety architecture with asynchronous monitoring. The primary outcome was detection of suicide risk requiring intervention. Results: The supervisory system detected suicide risk in 205 of 224 evaluations (91.5%) versus 41 of 224 (18.3%) for native LLM safeguards. Among 168 discordant evaluations, 166 favored the supervisory system and 2 favored the LLM (matched odds ratio {approx}83.0). Both systems detected risk in 39 evaluations, and neither in 17. Detection was highest in scenarios with explicit suicidal ideation and lower in more ambiguous presentations. Conclusions: Native LLM safeguards frequently failed to detect suicide risk in this structured evaluation. An independent monitoring approach substantially improved detection, supporting the role of external safety systems in high-risk mental health applications of LLMs.
Flathers, M.; Nguyen, P. A. H.; Herpertz, J.; Granof, M.; Ryan, S. J.; Wentworth, L.; Moutier, C. Y.; Torous, J.
Show abstract
BackgroundMillions of people use language models to discuss mental health concerns, including suicidal ideation, but limited frameworks exist for evaluating whether these systems respond safely. Benchmarking, the practice of administering standardized assessments to language models, offers direct parallels to clinical competency evaluation, yet few clinicians are involved in designing, validating, or interpreting these assessments. AimsTo introduce mental health professionals to benchmarking language models by administering a validated clinical instrument and demonstrating how configuration decisions, measurement limitations, and scoring context affect result interpretation. MethodWe administered the Suicide Intervention Response Inventory (SIRI-2) programmatically to nine commercially available language models from three providers. Each item was presented 60 times per model (three prompt variants x two temperature settings x 10 repetitions), yielding 27,000 model responses compared against point-in-time expert consensus. ResultsTotal scores ranged from 19.5 to 84.0 (expert panel baseline: 32.5). Prompt design alone shifted individual model scores by as much as the difference between trained and untrained human groups. The best performing model approached the instruments measurement floor. All nine models consistently overrated clinically inappropriate responses that sounded supportive. ConclusionsA single benchmark score can support markedly different claims depending on the assumed standard of clinical behavior, the instruments remaining measurement range, and the configuration that produced the result. The skills required to make these distinctions must become core competencies. Benchmark results are increasingly utilized to support claims about mental health safety that may not be accurate, making it necessary to close the gap between clinical measurement and AI. Plain Language SummaryAI chatbots like ChatGPT, Claude, and Gemini are increasingly used by millions of people to discuss mental health problems, including thoughts of suicide. To assess whether these systems handle such conversations safely, researchers give them standardized tests called benchmarks and compare their answers to those of human experts. These scores are already used to argue AI systems are ready for clinical use. This study gave a well-established test of suicide response skills to nine AI models from three major companies under varying conditions. We changed how much instruction the AI received and how much randomness was built into its responses, then measured whether the scores changed. The same AI model could score like a trained crisis counselor under one set of conditions and like an untrained undergraduate under another, depending on choices the person running the test made. Every model also made the same kind of mistake: responses that sounded warm and caring were rated as appropriate, even when experts had judged them to be clinically problematic. The highest-scoring model performed so well that the test could no longer measure whether it was truly skilled or had simply exceeded the tests range. These findings show that a single score can be misleading without knowing how the test was run, whether it can still distinguish strong from weak performance, and whether it matches what the AI is used for. Mental health professionals routinely make these judgments about clinical assessments and are well positioned to bring that expertise to AI evaluation.
Shi, Z.; Youngstrom, E. A.; Liu, Y.; Youngstrom, J. K.; Findling, R. L.
Show abstract
Pediatric bipolar disorder is challenging to diagnose accurately due to symptom heterogeneity. More standardized and data-driven approaches are needed to enhance diagnostic reliability. We evaluated a clinical decision tool (nomogram), statistical methods (logistic regression, LASSO), machine learning (support vector machine, random forest, k-nearest neighbors, extreme gradient boosting), and deep learning model (multilayer perceptron) for pediatric bipolar disorder prediction across two datasets collected in academic (N=550) and community (N=511) clinical settings. We compared three modeling strategies: cross-dataset validation, cross-dataset with interaction terms, and mixed-dataset. We assessed model performance using discrimination ability, calibration, and predictor importance ranking. In the baseline cross-dataset approach, all models showed good internal discrimination in the academic dataset; but external discrimination in the community dataset substantially declined. Interaction-enhanced models slightly improved internal discrimination but not external performance or calibration. Recalibration prominently improved cross-dataset calibration without compromising discrimination, indicating that transportability problems were largely driven by probability scaling. Models trained on mixed datasets exhibited much stronger external discrimination and calibration. Across models and training strategies, family risk and PGBI-10M were consistently ranked as the most important predictors. Predictive models for pediatric bipolar disorder showed strong internal performance but limited cross-setting generalizability due to dataset shift and miscalibration. Increasing model complexity did not improve external performance, whereas training on pooled data substantially improved both discrimination and calibration. Findings suggest that sampling diversity, rather than model complexity, is more valuable for developing clinically useful and generalizable psychiatric prediction models, underscoring the importance of open and collaborative datasets.
Provaznikova, B.; de Bardeci, M.; Altamiranda, E.; Ip, C.-T.; Monn, A.; Weber, S.; Jungwirth, J.; Rohde, J.; Prinz, S.; Kronenberg, G.; Bruehl, A.; Bracht, T.; Olbrich, S.
Show abstract
Objective: Major depressive episodes frequently show limited response to first-line treatments, motivating the search for objective biomarkers. EEG/ECG-based support tools aggregating electrophysiological predictors may guide treatment selection. We examined whether antidepressant treatments concordant with an EEG/ECG-biomarker report were associated with higher response rates. Methods: We retrospectively analyzed adults with ICD-10 depressive disorder or bipolar depression treated with electroconvulsive therapy (ECT), repetitive transcranial magnetic stimulation (rTMS), (es)ketamine, or selective serotonin reuptake inhibitors (SSRIs) between 2022 and 2024. Resting-state EEG with simultaneous ECG generated individualized biomarker reports with modality-specific response likelihoods. Treatment chosen by clinical teams was classified as concordant or non-concordant; response was derived from routinely collected clinical scales. Results: Among 153 patients (ECT n=53, rTMS n=48, (es)ketamine n=36, SSRIs n=16), response rates were higher for concordant vs non-concordant treatments: ECT 70% vs 50%, rTMS 30% vs 13%, (es)ketamine 31% vs 10%, and SSRIs 100% vs 11%. Overall, 46% (42/92) of concordant vs. 26% (14/54) of non-concordant patients responded (absolute difference +20 percentage points; relative increase {approx}77%; number needed to treat {approx}5). Conclusion: Concordance with EEG/ECG biomarkers correlated with higher treatment response, warranting confirmation in prospective trials. Significance: EEG/ECG-based decision support may enhance antidepressant treatment response in everyday clinical practice.
Donegan, M. L.; Srivastava, A.; Peake, E.; Swirbul, M.; Ungashe, A.; Rodio, M. J.; Tal, N.; Margolin, G.; Benders-Hadi, N.; Padmanabhan, A.
Show abstract
The goal of this work was to leverage a large corpus of text based psychotherapy data to create novel machine learning algorithms that can identify suicide risk in asynchronous text therapy. Advances in the field of natural language processing and machine learning have allowed us to include novel data sources as well as use encoding models that can represent context. Our models utilize advanced natural language processing techniques, including fine-tuned transformer models like RoBERTa, to classify risk. Subsequent model versions incorporated non-text data such as demographic features and census-derived social determinants of health to improve equitable and culturally responsive risk assessment, as well as multiclass models that can identify tiered levels of risk. All new models demonstrated significant improvements over our previous model. Our final version, a multiclass model, provides a tiered system that classifies risk as "no risk," "moderate," or "severe" (weighted F1 of 0.85). This tiered approach enhances clinical utility by allowing providers to quickly prioritize the most urgent cases, ensuring a more accurate and timely intervention for clients in need.
Kizilaslan, B.; Mehlum, L.
Show abstract
Purpose: Suicide and self-harm are major public health concerns characterized by substantial clinical and psychosocial heterogeneity. While latent class analysis has been used to identify subgroups of people with suicidal behavior, the extent to which such population-level phenotyping complements explainable artificial intelligence-based classification models remain unclear. Methods: We applied latent class analysis to a cross-sectional, publicly available dataset of 1000 individuals presenting with self-harm and suicide-related behaviors at Colombo South Teaching Hospital, Kalubowila, Sri Lanka. Sociodemographic, psychosocial, and clinical variables were used to identify latent subgroups. Class characteristics and suicide prevalence were examined and compared with variable importance patterns reported in a previously published explainable artificial intelligence (XAI)-based suicide classification study using the same dataset. Results: Four latent classes were identified. Two classes exhibited very high suicide prevalence (91.2% [95% CI: 87.7-93.8] and 99.0% [95% CI: 96.4-99.7]), whereas two classes showed low prevalence (<1%). The two high-prevalence classes differed markedly in lifetime psychiatric hospitalization history, with one class showing a 100% prevalence of prior hospitalization and the other substantially lower hospitalization rates. These patterns partially aligned with, and extended beyond, variable importance findings from the XAI-based model. Conclusion: Latent class analysis identified distinct subgroups with substantially different suicide prevalence and clinical profiles, underscoring the heterogeneity of individuals presenting with self-harm. Comparison with XAI-based suicide classification model findings suggest that unsupervised phenotyping and supervised classification provide complementary perspectives, offering population-level context that may enhance the interpretability of suicide assessment frameworks. Keywords: suicide; self-harm; latent class analysis; explainable artificial intelligence; machine learning
Zhu, T.; Tashevski, A.; Taquet, M.; Azis, M.; Jani, T.; Broome, M. R.; Kabir, T.; Minichino, A.; Murray, G. K.; Nour, M. M.; Singh, I.; Fusar-Poli, P.; Nevado-Holgado, A.; McGuire, P.; Oliver, D.
Show abstract
Psychosis prevention relies on early detection of individuals at clinical high risk for psychosis (CHR-P) remains limited, constraining preventive care. The effectiveness of the CHR-P state is constrained, in part due to clinical assessments requiring specialist interpretation of narrative interviews, limiting scalability. Here, we evaluate whether large language models (LLMs; deep learning models trained on large text corpora to process and generate language) can extract clinically meaningful information from such interviews to support psychosis risk assessment. We assessed 11 open-weight LLMs on 678 PSYCHS interview transcripts from 373 participants (77.7% CHR-P). Models inferred CHR-P status and estimated severity and frequency across 15 symptom domains, benchmarked against researcher-rated scores. Larger models achieved the strongest classification performance (Llama-3.3-70B: accuracy = 0.80, sensitivity = 0.93, specificity = 0.58). LLM-generated symptom scores showed good correlations with researcher-rated scores (ICCsev = 0.74, ICCfreq = 0.75). Performance disparities were minimal across most demographic groups but varied across sites. Generated summaries were largely faithful to source transcripts, with low rates of clinically relevant confabulation (3%). Errors primarily reflected over-pathologisation of non-clinical experiences. While accuracy scaled with model size, smaller models achieved competitive performance with substantially lower computational cost. These findings demonstrate that open-weight LLMs can assess psychosis risk from clinical interview transcripts, supporting scalable, human-in-the-loop approaches to early detection.
Bamberger, R.; Kuhles, G.; Lotter, L. D.; Dukart, J.; Konrad, K.; Guenther, T.; Siniatchkin, M.; Fuchs, M.; von Polier, G.
Show abstract
Background Diagnosis and treatment monitoring of attention-deficit/hyperactivity disorder (ADHD) largely rely on subjective assessments, highlighting the need for objective markers. Voice features and speech embeddings represent promising candidates for such markers, as they may capture alterations in speech production relevant to ADHD. However, it remains unclear which speech features are most informative for distinguishing ADHD and monitoring treatment effects, and which speech tasks most reliably elicit such differences. Methods Twenty-seven children with ADHD and 27 age-matched neurotypical controls completed six speech tasks across two study visits. Children with ADHD were unmedicated at baseline (first visit) and were assessed under prescribed methylphenidate treatment at follow-up, whereas controls underwent repeated assessment without intervention. Established acoustic voice features (eGeMAPS) and high-dimensional speech embeddings (WavLm, Whisper) were extracted and analysed using linear mixed models to examine baseline group differences and group-by-time interaction effects reflecting medication-associated change patterns. Results At baseline, children with ADHD differed significantly from controls in frequency, spectral, and temporal voice features, characterized by lower and more variable pitch, altered spectral properties, and reduced rhythmic stability. Group-by-time interaction effects indicated medication-associated modulation in the ADHD group, including reduced loudness variability and increased precision of vowel articulation at follow-up, changes not observed in controls. Speech embeddings revealed additional baseline and interaction effects beyond established acoustic features. Free speech tasks, particularly picture description, yielded the most robust and consistent effects. Conclusion Children with ADHD differed from neurotypical controls in vocal features at baseline and showed distinct longitudinal change patterns consistent with medication-related change. These findings support further investigation of speech-based measures as candidate digital phenotypes and potential digital biomarkers in ADHD, with picture description emerging as a particularly promising task for future clinical assessment protocols.
Radlowski Nova, J.; Lopez-Carbonero, J. I.; Corrochano, S.; Ayala, J. L.
Show abstract
BackgroundMixed-format lifestyle questionnaires contain both structured variables and free-text responses, but it remains unclear whether language-derived variables provide incremental predictive value beyond structured data, and under which representational condition. It was investigated whether variables derived from patient-reported free text improve ALS-versus-control classification beyond structured questionnaire data, and whether their value depends on how temporal information is represented. MethodsA leakage-free machine-learning pipeline was developed to classify ALS versus controls from questionnaire-derived data, including a schema-guided LLM-based text-to-table extraction and a compact longitudinal encoding strategy. Three feature configurations were compared: Pool1, containing structured baseline variables only; Pool2, adding compact summaries derived from first-time-point (T1) free-text responses; and Pool3, further incorporating compact descriptors of change between T1 and T2. Logistic Regression, linear Support Vector Classification, and Random Forest were evaluated using repeated stratified holdout (10 seeds) and repeated stratified 5-fold cross-validation. Final ablation analyses were performed to isolate the contribution of the compact text block and the compact temporal block. ResultsAfter leakage correction, performance estimates became more conservative, indicating that previous results had been optimistic. In the final configuration, Pool3 achieved the best performance, with Random Forest reaching a holdout accuracy of 0.673, F1-weighted score of 0.666, and Matthews correlation coefficient of 0.323; cross-validated F1-weighted score and Matthews correlation coefficient were 0.654 and 0.312, respectively. Pool2 did not show a robust improvement over Pool1. Ablation analysis showed that removing the compact temporal block markedly reduced Pool3 performance, whereas removing the compact text block had little overall effect. These findings indicate that the primary value of language-based processing in small clinical cohorts lies not in static feature enrichment, but in enabling compact representations of longitudinal change. ConclusionsIn this setting, the main predictive gain did not arise from static text-derived variables alone, but from representing questionnaire information as compact longitudinal change descriptors. These findings suggest that, in small clinical cohorts, the value of language-based processing may lie more in summarizing trajectories than in expanding static feature spaces.
Hossain, M. B.; Yan, R.; Morin, K. A.; Rotenberg, M.; Russolillo, A.; Solmi, M.; Lalva, T.; Marsh, D. C.; Nosyk, B.
Show abstract
Introduction People with bipolar disorder (BD) and concurrent opioid use disorder (OUD) experience more severe clinical outcomes, including higher mortality, treatment complexity, and worse psychiatric symptoms, yet they are underserved due to a lack of tailored clinical guidelines and limited supporting research on competing treatment options. While pharmacological treatments for BD are well-established, their use varies widely across settings, and their effectiveness in individuals with co-occurring OUD is unclear. We propose parallel population-based studies to emulate randomized controlled trials to assess the comparative effectiveness of pharmacological treatment options for BD among people with OUD in British Columbia and Ontario, Canada, 2010-2023. Methods and analysis We propose emulating a series of parallel target trials using linked population-level health administrative data for all individuals aged 18 years or older diagnosed with both BD and OUD and who initiated treatments for BD between 1 January 2010 and 31 December 2023. All analyses will be conducted in parallel in British Columbia and Ontario. We propose a series of four successive target trial emulations, comparing (i) lithium versus non-antipsychotic mood stabilizers such as divalproex, lamotrigine, and valproic acid; (ii) lithium versus 2nd generation antipsychotics with mood stabilizing properties such as risperidone, olanzapine, aripiprazole, and quetiapine; (iii) lithium versus combination treatments such as lithium and divalproex, lithium and olanzapine, lithium and aripiprazole, lithium and quetiapine, divalproex and olanzapine, and olanzapine and quetiapine; (iv) lithium and valproate (LATVAL) versus lithium and olanzapine, lithium and aripiprazole, lithium and quetiapine, divalproex and olanzapine, and olanzapine and quetiapine. Incident user and prevalent new user analyses are planned for proposed target trials (i)-(iv), pending sufficient data. Stratified analyses will be conducted for BD-I, manic and depressive phases of BD illness. We propose an initiator analysis (intention-to-treat, conditional on medication dispensation) to determine the effectiveness of the treatments and per-protocol analyses to determine the efficacy of the treatments after dealing with treatment switching and recommended dose adjustment. The outcomes will include psychiatric acute-care visits (hospitalizations and emergency department visits), BD treatment discontinuation and all-cause mortality. Subgroup and sensitivity analyses, including cohort and study timeline restrictions, eligibility criteria modifications, and outcome reclassifications, are proposed to assess the robustness of our results. Executing analyses in parallel across settings using a co-developed protocol will allow us to evaluate the replicability of findings. Ethics and dissemination The protocol, cohort creation, and analysis plan have been classified and approved as a quality improvement initiative by the Providence Health Care Research Ethics Board and the Simon Fraser University Office of Research Ethics. Results will be disseminated to local advocacy groups, clinical groups and decision-makers, national and international clinical guideline developers, presented at international conferences, and published in peer-reviewed journals.
Mallevays, M.; Fuet, L.; Danon, M.; Di Lodovico, L.; Jaffre, C.; Bouzeghoub, L.; Mrad, S.; Rousselet, A.-V.; Allary, L.; Muh, C.; Vissel, B.; De Maricourt, P.; Vinckier, F.; Gaillard, R.; Mekaoui, L.; Gorwood, P.; Petit, A.-C.; Berkovitch, L.
Show abstract
Esketamine is a fast-acting antidepressant drug which induces acute psychoactive effects. The most frequent is a dissociative state which seems unrelated to therapeutic efficacy. Other esketamine-induced effects, including psychedelic-like mystical experiences, have been poorly studied in terms of phenomenology and frequency, and may carry specific therapeutic relevance. In this study, we characterised esketamine-induced mystical experiences in relation with clinical outcomes. We conducted a longitudinal observational study and systematically measured acute subjective effects in patients receiving esketamine for treatment-resistant depression after each administration across the induction phase. A total of 45 patients were included, from two independent centres, totalling 352 esketamine administrations. Principal Component Analysis (PCA) supported the validity of the Mystical Experience Questionnaire (MEQ-30) for assessing esketamine-induced subjective effects, with components recovering dimensions previously validated with classic psychedelics. Mystical experiences (MEQ-30 score above 60) occurred in 58% of patients, with high inter- and intra-individual variability in frequency, intensity, and phenomenology across sessions. Higher mean and peak MEQ scores were associated with greater improvement in Montgomery-Asberg Depression Rating Scale scores from pre- to post-treatment, whereas the intensity of dissociative or other non-mystical effects was not. Positive mood and mystical MEQ dimensions in particular predicted therapeutic outcomes. Baseline spirituality also significantly predicted treatment outcomes and peak MEQ scores in the first week of treatment. These findings add to the growing body of evidence suggesting that psychedelic-like mystical experiences may be associated to therapeutic efficacy, not only in classic psychedelic-assisted therapy, but also in esketamine treatment.
Kim, J. E.; Holbrook, E. B.; Hron, J. D.; Parsons, C. R.
Show abstract
BackgroundConversational AI safety systems are primarily evaluated using message-level content monitoring, which assesses inputs and outputs in isolation. This message-by-message approach can miss interaction-level risks that emerge over extended conversations, including patterns discussed in reports of "AI psychosis." Critically, by the time users express overt psychosis-spectrum content, opportunities for intervention may be limited. ObjectiveWe investigated whether LLM responses gradually expand and connect interpretations beyond the users original concerns, a process we term structural drift. We also tested whether this drift can be detected early and automatically. MethodsWe developed an automated, LLM-adapted rubric-based prompt for seven domains of anomalous (psychosis-spectrum) experience, derived from phenomenological psychiatry to capture subtle shifts in subjective interpretation. In Part 1, we evaluated the rubric using gold-standard text excerpts (N = 484) adapted from clinically validated qualitative instruments. In Part 2, we analyzed 1,290 user-LLM response exchanges from 7 dialogues, using 3 different LLMs (5 repeats each), to measure (i) domain amplification (increasing score within a domain) and (ii) domain expansion (new domains appearing over time). ResultsAutomated scoring showed strong agreement with gold-standard excerpts (domain accuracy 82.7-98.9%; exact 0-3 agreement 63.6-82.7%). Across dialogues, we observed significant amplification in four domains (p < .05; d = 0.14-0.46) and domain expansion in 83.8% of dialogues (88/105; p < .001). ConclusionsAI responses can systematically expand and intensify users descriptions beyond their initial input. Taken together with the predictive-processing accounts of psychosis, the exposure itself may reinforce maladaptive inferences. Because drift is detectable from ordinary dialogue without clinical-style probing, this structural drift detection may support scalable, real-time monitoring for emerging risks before overt escalation.
Wickersham, A.; Soneson, E.; Adamo, N.; Colling, C.; Jewell, A.; Downs, J.
Show abstract
BackgroundA study conducted in Norway showed that the association between pupil mental health diagnoses and educational attainment has weakened over time. One possible explanation is that earlier detection of mental health problems in recent years has facilitated earlier treatment, intervention, and educational support that might improve academic outcomes. We investigated whether the weakening association between mental health and attainment could be replicated in England, and explained by earlier age at first diagnosis. MethodsThis was a secondary longitudinal data analysis of de-identified records from a secondary mental healthcare provider in England, which have been linked to the Department for Educations National Pupil Database. We included n=149,841 pupils residing in South East London, born 1993-2003, who completed their end-of-school exams 2009-2019. The main exposure variables were ADHD and internalising disorder diagnosis. In linear regressions, we investigated their associations with Year 11 attainment (typically assessed age 15-16 years), whether this was modified by birth year, and the role of age at first diagnosis. ResultsOn average, ADHD (n=844, 0.6%) and internalising disorder (n=2,523, 1.7%) were associated with lower Year 11 attainment. However, significant interactions between diagnosis and birth year suggested that pupils with these disorders showed increases in standardised exam scores over successive birth cohorts, resulting in a closing attainment gap over time. While age at first diagnosis became younger over the period, this did not confound the observed associations. ConclusionsWe replicated findings from Norway that suggest a narrowing attainment gap between those with and without ADHD and internalising disorder diagnoses. Building on this, we ruled out earlier age of diagnosis as a possible explanation for this phenomenon. With administrative data research growing internationally, we are increasingly able to replicate mental health and education trends in different countries, opening more opportunities for international collaboration.
Lim, A.; Pemberton, J.
Show abstract
Background: The NHS Improving Access to Psychological Therapies (IAPT) programme, now rebranded as NHS Talking Therapies, faces persistent capacity constraints with average wait times exceeding 90 days for cognitive behavioral therapy (CBT) in many Clinical Commissioning Group areas. AI-powered CBT platforms have been introduced as a digital adjunct within stepped care, yet longitudinal evidence on anxiety symptom trajectories and their predictors in routine NHS settings remains limited. Objective: To model individual anxiety symptom trajectories among patients referred to an AI-powered CBT platform within NHS primary care, identify distinct trajectory classes, and examine patient-level and practice-level predictors of differential treatment response using multilevel growth curve modeling. Methods: A prospective cohort study was conducted using linked clinical and administrative data from 6,284 patients (aged 18-65) referred to the CalmLogic AI-CBT platform across 187 general practices in four NHS England Integrated Care Systems (ICSs) between April 2023 and September 2025. Patients completed GAD-7 assessments at baseline, 4 weeks, 8 weeks, 12 weeks, and 24 weeks. Three-level growth curve models (assessments nested within patients nested within practices) with random intercepts and random slopes were fitted. Growth mixture modeling (GMM) was subsequently applied to identify latent trajectory classes. Predictors were examined at Level 2 (patient demographics, baseline severity, comorbidities, digital literacy, engagement intensity) and Level 3 (practice deprivation index, list size, urban/rural classification, and IAPT wait time). Results: The unconditional growth model revealed a significant average linear decline in GAD-7 scores of -0.94 points per month (p < .001), with substantial between-patient variation in both intercepts (variance = 14.82, p < .001) and slopes (variance = 0.38, p < .001). Significant between-practice variation accounted for 8.7% of intercept variance (ICC = 0.087). Growth mixture modeling identified four distinct trajectory classes: Rapid Responders (28.4%, steep early decline stabilising by week 8); Gradual Improvers (34.1%, steady linear decline through 24 weeks); Partial Responders (22.8%, modest early improvement followed by a plateau at clinically significant levels); and Non-Responders (14.7%, minimal change or slight deterioration). Higher baseline severity, female gender, and greater module completion predicted membership in the Rapid Responder class. Practice-level IAPT wait times exceeding 90 days independently predicted faster improvement trajectories (coefficient = -0.31, p = .003), suggesting that AI-CBT has its greatest incremental value in capacity-constrained areas. Patients in the most deprived quintile showed slower trajectories (coefficient = 0.22, p = .011) despite equivalent engagement levels, indicating a deprivation-related treatment response gap. Conclusions: AI-powered CBT platforms integrated within NHS primary care produce significant anxiety symptom reduction on average, but treatment response is heterogeneous, with four distinct trajectory classes identified. The finding that longer IAPT wait times predict better AI-CBT outcomes supports the platform's positioning as a scalable bridge intervention for capacity-constrained services. The deprivation-related response gap warrants targeted support strategies for patients in the most disadvantaged communities.
Soman, A.; Dev, S. S.; Ravindren, R.
Show abstract
Background Phonemic awareness deficits are a core feature of Specific Learning Disorder-Reading (SLD-R). How task- and language-specific factors influence these deficits in alphasyllabary languages may help clarify the cognitive mechanisms underlying reading impairment in SLD-R. Methods Thirty children with a DSM-5 diagnosis of SLD-R (mean age 11.4 years) and 29 age-matched typically developing children were given phoneme blending (words and pseudowords) and segmentation tasks in Malayalam. The effects of age and consonant clusters on task performance were evaluated. Results Children with SLD-R performed significantly worse than controls across most phonemic awareness tasks, with the largest deficits observed in pseudoword blending and word blending, and smaller deficits in segmentation. No significant difference was observed for initial phoneme deletion. In typically developing children, age showed strong positive correlations with phonemic performance across most tasks, whereas the SLD-R group showed weak or absent correlations, except in word blending and initial phoneme deletion. Consonant clusters significantly affected performance in both groups, with SLD-R showing more severe deficits. Conclusions Phonemic awareness deficits observed in SLD-R in alphasyllabary languages like Malayalam are more prominent in tasks where lexical support is absent, like pseudoword blending. These deficits vary across task types and linguistic complexity. Phonemic awareness improves with age in typically developing children, while improvement is uneven in children with SLD-R. The findings suggest that phonemic awareness deficits are a core feature of SLD-R across languages, but their manifestation is shaped by orthographic and linguistic characteristics of the writing system.
Grimbly, M. J.; Koopowitz, S.; Chen, R.; Hu, W.; Sun, Z.; Foster, P. J.; Stein, D. J.; Zhu, Z.; Ipser, J. C.
Show abstract
BackgroundOptical coherence tomography (OCT) is increasingly used to investigate retinal structural changes across neurological and neuropsychiatric conditions. This systematic review and meta-analysis synthesises evidence examining retinal thickness in anxiety, depression, and substance use disorders (SUD) compared with healthy controls. MethodsA pre-registered systematic search (PROSPERO: CRD42024559542) of major databases following PRISMA guidelines was conducted. Case-control studies measuring retinal layer thickness via OCT in adults with DSM or ICD diagnosed anxiety, depression, or SUD were included. Multilevel random-effects models were used to calculate pooled standardised mean differences (SMD) and account for dependencies. ResultsThirty-three studies were included for narrative review, and 25 studies with 145 effect sizes were included for meta-analysis. The primary analysis, which pooled all disorders and effect sizes from available retinal thickness measures, found no significant differences between cases and controls (SMD = -0.20; 95% CI [-0.53, 0.14]; p = .244). Subgroup analyses for anxiety, depression, and SUD also yielded non-significant results (all p > .05). No specific retinal layer was consistently affected, and there was no evidence of an age x diagnosis interaction. Significant heterogeneity (Q = 756.57, p < .001) was present across analyses. ConclusionThis meta-analysis found no significant associations between retinal structural differences and anxiety, depression, or SUD. The field is characterised by high heterogeneity and publication bias, limiting the strength of evidence for the utility of OCT as a reliable biomarker for these conditions. Standardised, large-scale studies are needed with strict controls for confounding factors, including medication, disease stage and ocular parameters, alongside standardised OCT segmentation protocols. Article HighlightsO_LIFirst meta-analysis of OCT retinal thickness in anxiety, depression and SUD. C_LIO_LINo significant retinal thickness differences found between cases and healthy controls. C_LIO_LIAge and sex did not moderate the association between diagnosis and retinal thickness. C_LIO_LIHigh heterogeneity and publication bias limit utility of OCT as a neuropsychiatric biomarker. C_LIO_LIStandardised protocols are needed to clarify retinal changes in psychiatric research. C_LI
Ferreira, C.; Lim, A.
Show abstract
Background: AI powered cognitive behavioral therapy CBT chatbots represent a scalable approach to addressing the global mental health treatment gap However causal evidence on their population level effectiveness in low and middle income countries LMICs remains limited and patient perspectives on acceptability and engagement are critical determinants of sustained use Brazils Estrategia de Saude da Familia ESF deployed an AI powered CBT chatbot Saude Mental Digital SMD to registered patients aged 18 and older at participating primary care units with eligibility determined by a composite vulnerability score exceeding a predetermined threshold Objective: To estimate the causal effect of AI powered CBT chatbot access on anxiety and depressive symptoms among primary care patients in Minas Gerais Brazil leveraging the eligibility score threshold as an exogenous source of variation Methods: We conducted a fuzzy regression discontinuity design fuzzy RDD study using linked administrative and clinical data from 312 ESF primary care units across Minas Gerais N 43287 patients January 2022 December 2024 The running variable was the composite vulnerability score with a threshold of 60 points determining chatbot eligibility The primary outcome was the 12 week change in the Patient Health Questionnaire Anxiety and Depression Scale PHQ ADS composite score Two stage least squares 2SLS estimation was used with local polynomial regression and triangular kernel weighting Bandwidth selection followed the Calonico Cattaneo Titiunik CCT optimal procedure Results: The fuzzy RDD estimated a local average treatment effect LATE of 473 points 95 CI 691 to 255 p 0001 on the PHQ ADS composite score at the eligibility threshold indicating clinically meaningful symptom reduction among compliers First stage estimates confirmed a strong 312 percentage point jump in chatbot uptake at the threshold F statistic 1274 Subgroup analyses revealed larger treatment effects among patients in rural municipalities 618 95 CI 902 to 334 those with lower educational attainment 582 95 CI 844 to 320 and women 537 95 CI 761 to 313 McCrary density tests confirmed no evidence of running variable manipulation p 067 Results were robust across alternative bandwidths polynomial orders and kernel specifications Conclusions: AI powered CBT chatbot access causally reduces anxiety and depressive symptoms among primary care patients near the eligibility threshold in Brazil with particularly pronounced benefits for rural less educated and female populations These findings provide quasi experimental evidence supporting the scalable deployment of AI powered CBT tools within public primary care systems in LMICs while underscoring the importance of incorporating patient perspectives on acceptability to maximize engagement and sustained therapeutic benefit
Umar, M.; Hussain, F.; Khizar, B.; Khan, I.; Khan, F.; Cotic, M.; Chan, L.; Hussain, A.; Ali, M. N.; Gill, S. A.; Mustafa, A. B.; Dogar, I. A.; Nizami, A. T.; Haq, M. M. u.; Mufti, K.; Ansari, M. A.; Hussain, M. I.; Choudhary, S. T.; Maqsood, N.; Rasool, G.; Ali, H.; Ilyas, M.; Tariq, M.; Shafiq, S.; Khan, A. A.; Rashid, S.; Ahmad, H.; Bettani, K. U.; Khan, M. K.; Choudhary, A. R.; Mehdi, M.; Shakoor, A.; Mehmood, N.; Mufti, A. A.; Bhatia, M. R.; Ali, M.; Khan, M. A.; Alam, N.; Naqvi, S. Q.-i.-H.; Mughal, N.; Ilyas, N.; Channar, P.; Ijaz, P.; Din, A.; Agha, H.; Channa, S.; Ambreen, S.; Rehman,
Show abstract
BackgroundMajor depressive disorder (MDD), a leading cause of disability worldwide, exhibits substantial heterogeneity in treatment outcomes. Patients who do not respond to standard antidepressant therapy account for the majority of MDDs disease burden. Risk factors have been implicated in treatment response, including genes impacting on how antidepressants are metabolised. Yet, despite its clinical importance, risk factors for treatment-resistant depression (TRD) remain unexplored in low- and middle-income countries (LMIC). We used data from the DIVERGE study on MDD to investigate the risk factors of TRD in Pakistan. MethodsDIVERGE is a genetic epidemiological study that recruited adult MDD patients ([≥]18 years) between Sep 27,2021 to Jun 30, 2025, from psychiatric care facilities across Pakistan. Detailed phenotypic information was collected by trained interviewers and blood samples taken. Infinium Global Diversity Array with Enhanced PGx-8 from Illumina was used for genotyping followed by DRAGEN calling to infer metaboliser phenotypes for Cytochrome P450 (CYP) enzyme genes. We defined TRD as minimal to no improvement after [≥]12 weeks of adherent antidepressant therapy. We conducted multi-level logistic regression to test the association of demographic, clinical and pharmacogenetic variables with TRD. FindingsAmong 3,677 eligible patients, polypharmacy was rampant; 86% were prescribed another psychotropic drug along with an antidepressant. Psychological therapies were uncommon (6%) while 49% of patients had previously visited to a religious leader/faith healer in relation to their mental health problems. TRD was experienced by 34% (95%CI: 32-36%) patients. The TRD group was characterised by more psychotic symptoms and suicidal behaviour (OR=1.39, 95%CI=1.04-1.84, p=0.02; OR=1.03, 95%CI=1.01-1.05, p=0.005). Social support (OR=0.55, 95%CI=0.44-0.69, p=1.4x10-7) and parents being first cousins (OR=0.81, 95%CI=0.69-0.96, p=0.01) were associated with lower odds of TRD. In 1,085 patients with CYP enzyme data, poor (OR=1.85, 95%CI=1.11-3.07, p=0.01) and ultra-rapid (OR=3.11, 95%CI=1.59-6.12, p=0.0009) metabolizers for CYP2C19 had increased risk of TRD compared with normal metabolisers. InterpretationThere was an excessive use of polypharmacy in the treatment of depression while psychological therapies were uncommon highlighting the need for more evidence-based practice. This first large study of MDD from Pakistan uncovered the importance of culture-specific forms of social support in preventing TRD, highlighting opportunities for interventions in low-income settings. Pharmacogenetic markers can be leveraged to predict TRD.
Jacobsen, A. M.; Quednow, B. B.; Bavato, F.
Show abstract
ImportanceBlood neurofilament light chain (NfL) and glial fibrillary acidic protein (GFAP) are entering clinical use in neurology as markers of neuroaxonal and astrocytic injury, but their utility in psychiatry is unclear. ObjectiveTo determine whether psychiatric diagnoses are associated with altered plasma NfL and GFAP levels. Design, Setting, and ParticipantsThis population-based study examined plasma NfL and GFAP among 47,495 participants from the UK Biobank (54.0% female; 93.5% White; mean [SD] age 56.8 [8.2] years) who provided blood samples and sociodemographic and clinical data between 2006 and 2010. Normative modeling was applied to assess associations between 7 lifetime psychiatric diagnostic categories and deviations from expected NfL and GFAP levels, while accounting for neurological diagnoses, cardiometabolic burden, and substance use. Data were analyzed between July 2025 and March 2026. Main Outcomes and MeasuresDeviations in plasma NfL and GFAP levels from normative predictions. ResultsRelative to the reference population, plasma NfL levels were higher among individuals with bipolar disorder (d=0.20; 95% CI, 0.03-0.37; p=0.03), recurrent depressive disorder (d=0.23; 95% CI, 0.07-0.38; p=0.009), and depressive episodes (d=0.06; 95% CI, 0.02-0.10; p=0.01), lower among individuals with anxiety disorders (d=-0.07; 95% CI, -0.12 to -0.02; p=0.008), but did not differ in schizophrenia spectrum, stress-related, or other psychiatric disorders. Plasma GFAP levels were not elevated in any psychiatric disorders. Variability in NfL levels was greater among individuals with schizophrenia spectrum disorders (variance ratio [VR]=1.30; p=0.005), depressive episodes (VR=1.06; p=0.006), and anxiety disorders (VR=1.08; p=0.005). Variability in GFAP levels was increased only in anxiety disorders (VR=1.08; p=0.01). Plasma NfL levels exceeding percentile-based normative thresholds were more common among individuals with schizophrenia spectrum disorders, bipolar disorder, recurrent depressive disorder, and depressive episodes. Neurological diagnoses, cardiometabolic burden, and substance use were associated with plasma NfL and GFAP levels. Conclusions and RelevanceThis study provides population-level evidence of plasma NfL elevation in bipolar and depressive disorders and increased variability in schizophrenia spectrum, bipolar and depressive disorders, supporting its potential as a biomarker in psychiatry and informing its ongoing neurological applications. Plasma GFAP levels, in contrast, were largely unaltered across psychiatric disorders. Key PointsO_ST_ABSQuestionC_ST_ABSAre plasma neurofilament light chain (NfL) and glial fibrillary acidic protein (GFAP) levels altered in psychiatric disorders? FindingsIn this cohort study including 47,495 individuals, normative modeling revealed that plasma NfL levels were elevated in bipolar and depressive disorders, whereas plasma GFAP levels were not elevated in any psychiatric disorder. Plasma NfL levels also showed higher variability in schizophrenia spectrum, bipolar, and depressive disorders. MeaningPlasma NfL shows distinct alterations in schizophrenia spectrum and affective disorders, supporting its further investigation as a biomarker in clinical psychiatry and highlighting the need to consider psychiatric comorbidity in neurological applications.
Williamson, G.; Carr, E.; Varghese, R.; Dymond, S.; King, K.; Simms, A.; Goodwin, L.; Murphy, D.; Leightley, D.
Show abstract
Background: Alcohol misuse is common in the UK Armed Forces (AF) community, with prevalence higher than in the general population. To date, digital health initiatives to address alcohol misuse have largely focused on men, who represent around 88% of the UK AF. However, women who have served in the UK AF also drink disproportionately more than women in the general population. Objective: This two-arm participant-blinded (single-blinded) confirmatory randomized controlled trial (RCT) aimed to assess the efficacy of a brief alcohol intervention (DrinksRation) compared to a web application which included NHS-focused drinking advice (BeAlcoholSmart) in reducing weekly self-reported alcohol consumption between baseline and 84-day follow-up among women who have served in the UK AF. Methods: A smartphone app (DrinksRation) was compared with government guidance on alcohol use. The app included features tailored to the needs of women who have served and was designed to enhance motivation to reduce alcohol consumption. The trial enrolled women who had completed at least one day of paid service in the UK Armed Forces. Recruitment, consent, and data collection were completed automatically through the platform. The primary outcome was the between-group difference in change in self-reported weekly alcohol consumption from baseline to day 84, measured using the Timeline Follow-Back method. The secondary outcome was the between-group difference in change in Alcohol Use Disorders Identification Test (AUDIT) score from baseline to day 84. Process evaluation outcomes included app engagement and usability, with usability assessed using the mHealth App Usability Questionnaire. Results: A total of 88 women UK AF veterans were included in the final analysis (control=37; intervention=51). At 84 days post-baseline, participants in the intervention group (DrinksRation) showed a greater reduction in weekly alcohol consumption compared to controls (BeAlcoholSmart) (adjusted mean difference in change from baseline = -11.6 units; 95% CI: -19.7 to -3.6; p=0.005). AUDIT scores decreased more in the intervention group (adjusted mean difference in change = -3.9; 95% CI: -6.9 to -1.0; p=0.01). Usability scores at day 28 were significantly higher for the intervention group across all domains. No serious adverse events or technical issues were reported. Conclusions: DrinksRation reduced alcohol consumption and hazardous drinking among women who have served in the UK Armed Forces. Engagement was strong, usability was high, and no safety concerns were identified. These findings support the potential of tailored digital interventions to address alcohol misuse in women who have served in the UK Armed Forces. Registration: ClinicalTrials.gov (trial registration: NCT05970484).